Abstract
In this take home work, you will have the opportunity to look over in depth the example of webscraping presented within the session. While this example won’t teach you R, it will attempt to guide you through some of the basics so that you understand what is going on and how one would do web-scrapingR is a language. Remember this at all times. You do not learn languages in one day. Some people learn languages via apps. Some people learn languages with teachers in classes. Some people learn languages through immersion. Remember this is it feels challenging.
This lesson on web-scraping is not going to leave you fluent in R. It is more like a repeat after me sing-along. It will teach you the words and show you how to say them and you will definitely understand how songs are sung after this but it doesn’t mean you will be able to go and write your own song without help. If you want that, we recommend taking more classes on R.
R is a programming language designed for statistics and RStudio is a code editor (an IDE: integrated development environment) where you can work with R. This work assumes you already have them both installed as well as Google Chrome.
Let’s start by opening up RStudio and covering some basics. Once you open RStudio (not R 4.2.1, or whatever version you have installed), there will be a window with three panes like this:
The left side has the console, where R code is run and all the magic
happens. Top right is the Environment and history pane. These is where
you will see the things you make when you run the code in the console.
Bottom right shows multiple tabs including Plots (where graphs you make
are shown) and Help (where you can search for manuals on any function
you use).
There is one more pane we need to add to this before we start. As you code, you want to keep track of all the code you write and execute. To do that, we create an R Script. An R Script is simply a text file that stores the commands (code) you run in the console. It is a journal, if you will. To open a new one, click the image of the white paper with the green plus symbol and select R Script from the drop down. A new window will appear in the top left with a blank page.
Save your script in a folder you would like to work and name the file “webscrape.R”. Avoid using spaces in your filenames when coding as often computers have problems with them. Use an underscore instead.
In R, we work with data but how do we store that data. R stores information as ‘objects’. Objects have a name, and can contain everything from a single number, a string of letters, a table of data or some program code. You can think of objects as containers. Containers can be file folders, filing cabinets, book shelves, tote bags, bin bags. They all store things in different ways. Thinking back to maths classes, we took long ago, objects are like the variables we use in formulas to solve equations. Like Pythagorean theorem: \(a^2+b^2=c^2\), \(a\), \(b\), and \(c\) are the variables/objects that we replace with information to get some answer.
In R you assign a value to an object with <- or
=. The hardest part of objects is naming them. There are
numerous naming conventions, but the key is to not use spaces, do not
start with numbers, and have them be meaningful. If your object is a
list of information about teapots, name the object
teapot.
R comes with a lot of functionality built in everyything included in a fresh install of R is known as “base R”. The best part of R is the add-ons called packages you can install within R. Packages contain functions and functions are how things get done in R. Functions are prebuilt code that can find the mean of some numbers to running complicated computer simulations thousands of times to simulate randomness. If you have used Excel, they are similar to excel functions in usage.
For this work, we are only going to need one package, rvest. Let’s install it now via coding. Copy the line below to your R script (the top left pane) and then, while your cursor is on the line (selected), press either the Run button in the top right of the script pane or CTRL+ENTER. This will run the line of code in the console pane below.
install.packages("rvest")
You will receive a message about the package being successfully ‘unpacked’ (installed). This simply installs the package on the computer, however, like computer applications you need to ‘open’ them to use them. To ‘open’ a package so we can use it, you load a package into your library, so R recognises the functions you are typing are the ones in this package. The first line in most scripts will be loading packages to your library, as they must be done everytime you work with R.
library(rvest)
This is the start of going through the webscraping example used in the workshop. Copy each line of code into your script in RStudio and run it using either the run button or CTRL+Enter while that line is selected. Try to understand what each line is doing when you run it. The complete code will be available at the end of the exercise.
We start with a link to a webpage. In this case, the “url” (link or web address) is to the search results on the Fitzwilliam Museum’s catelogue for the term “pottery” and filtered by department, “Applied Arts”.
Our goal is to go through all the pages available and find the links to each of the objects listed here that you would normally just click on to follow to the object collection page.
Let’s tackle th “all the pages available” part first as it is easy in this case. On the Fitzwilliam website, as you click the next button to go to the next page, the url changes slightly to add “page=2”. If you edit the url and change the number 1 or 42 or 101, it will take you to that page. This means all we need to do to obtain all the pages is look at the number of pages there are and create links that change the “page=#” to a different number.
Store the url as an object called url. Quotes tell R that object is to be treated as a string of characters.
url <- "https://data.fitzmuseum.cam.ac.uk/search/results?query=pottery&operator=AND&sort=desc&department=Applied%20Arts&page="
Next we create an object to store the page numbers for us. For sake of speed and your computer power, let’s only pretend there are 3 pages. A colon between two numbers in R means give me all the numbers from this number to that number.
num_pages = 1:3
If you type, just the name of the object and run that, it will show you what is stored in the object in the console like so:
num_pages
## [1] 1 2 3
Use this technique to double check your work as you go and ensure something hasn’t gone wrong along the way
Now we combine the list of numbers with the url we have made using
our first function. Functions follow the convention of
function_name(). The function has “arguments”, the data you
need to give the function to do its purpose. In this case you give it
the url object and the num_pages which
it will take both and combine them together into one string of
characters.
coll_pages <- paste0(url, num_pages)
## [1] "https://data.fitzmuseum.cam.ac.uk/search/results?query=pottery&operator=AND&sort=desc&department=Applied%20Arts&page=1"
## [2] "https://data.fitzmuseum.cam.ac.uk/search/results?query=pottery&operator=AND&sort=desc&department=Applied%20Arts&page=2"
## [3] "https://data.fitzmuseum.cam.ac.uk/search/results?query=pottery&operator=AND&sort=desc&department=Applied%20Arts&page=3"
We have all the links but it is much easier to start with just one
and make sure we figure it out there first. Square brackets allow you to
access the information within an object stored as a list or vector. So
this gets us the first of the 3 url in coll_pages.
page_one <- coll_pages[1]
## [1] "https://data.fitzmuseum.cam.ac.uk/search/results?query=pottery&operator=AND&sort=desc&department=Applied%20Arts&page=1"
With our url in hand, we need to fetch the HTML for the page. We are
reading all the code that makes up the site page you saw in the workshop
and storing it in an object with the function read_html()
which only requires you give it a object that has a string of characters
that can be read as a website url.
collect_webpage = read_html(page_one)
Step one: Complete! You have the code for a webpage on your computer!
We have the code, now let’s take it a part and find what we need shall we? For this we need to look at the code on the internet and find the element we need and its CSS selector.
Where is the link? Go back to the search results page. If you click, on the image of an object or the name, it takes you to the page for it, so we know that whatever code makes that up has to contain the link to the page. Just like if you make a hyperlink in a document you need to provide the url for it to work.
Ensuring you are in Google Chrome, press CTRL+U on the results page
and a new page will open with the page’s source code. Press CTRL+F and
type in the name of an object on this page. The image below shows the
results for “Venus” The link stored in
<a> is
the same as the URL for the object page and what we need. As there are
many <a> elements in the code how do we find the ones
we need? We use a CSS selector. Don’t worry about the code that makes up
a CSS selector, just remember that it is basically a pointer to a bit of
code.
We will use a tool called “Selector Gadget” to get the selector for this element. You can either add the Google Chrome Extension here or follow the instructions on this page to use.
Assuming you installed the extension, on the results page, click the
selector gadget extension symbol to activate it then click the name of
one of the objects with the link. Like so: Next, click all
the things you don’t want to be selected, as shown below.
Notice how with
each deselection the CSS selector at the bottom of the screen changes.
We end up with
".mb-3 .lead a". With this, it appears that
only the titles are selected, but if you look at the number next to the
clear button it will say 29. This is odd because there are only 24
objects on the page. In this case, there is a hidden menu messing with
us. If you turn off selector gadget by pressing the extension button
again, within the search box, open the “Filter your Search” menu. If you
turn on select gadget and click one of the words, you will notice a
similar selection of yellow boxes appear and 5+24 = 29. We have found
our culprit. Only problem is you can’t interact with anything else when
in the “Filter your Search” menu and selector gadget.
We need to know what each element is to add it to our CSS selector.
We will use Google Chrome’s inspect option. Right-click on an object
name and click inspect from the menu. A pane will appear on the side of the screen. You
will likely have
<a> selected. If you hover on the
level above you will see <h3.lead> has a similar
space. Open the filter menu and inspect the “Department”. It is
similarly an <a> but if you hover over the level
above, it is instead a <h5.lead>.
So, we want h3
not h5 and can add that to our selector to be
".mb-3 h3.lead a". With our selector in hand, we can run a
function that will select all HTML elements from
collect_webpage that match our selector.
elements = html_elements(collect_webpage, ".mb-3 h3.lead a")
Now we have all the elements with our page link. We need to retrieve
them. If you go back to the source code we looked at in the beginning
the url is stored as part of the element as <a href=.
“href” is what is called an attribute of the element
<a>. As well, the url is already in quotes, meaning
it is a string of characters, and doesn’t need to be cleaned to be
understandable.
pages = html_attr(elements, "href")
head(pages)
## [1] "https://data.fitzmuseum.cam.ac.uk/id/object/76487"
## [2] "https://data.fitzmuseum.cam.ac.uk/id/object/17525"
## [3] "https://data.fitzmuseum.cam.ac.uk/id/object/201797"
## [4] "https://data.fitzmuseum.cam.ac.uk/id/object/71313"
## [5] "https://data.fitzmuseum.cam.ac.uk/id/object/75738"
## [6] "https://data.fitzmuseum.cam.ac.uk/id/object/11708"
The function head() allows you to look at the first 6 entries in an objects.
We now have all the links on one page of our results! Take a second to pat yourself on the back you have webscraped and that is worth a congratulations
But we must press forward and we don’t just want one page we want all
the pages (in this case 3 pages). We will now take the list of results
pages we made coll_pages and the steps we used to retrieve
the links and automate it. Through the power of loops!
Loops are used to execute a group of instructions or a block of code multiple times, without writing it repeatedly. Picture a flowchart where it asks you a question and if you say yes you continue or if you say no it goes back to the beginning. These are loops. There are 3 types: For-Loops, While-Loops, and If-statements. We will be using For-loops and If-statements in this work.
For-loops are iterative (repeated in a sequence) conditional
statements. They follow the form of
for (variable in vector) {}. Picture this you have a bag of
apples, some are green and some are red, you need to check the colour of
each. You pick each apple out of the bag one at a time and say what
colour it is until the bag is empty; for the apple in the bag, say the
color:
for (apple in bag) {
print(apple)
}
While-loops are conditional statements that say that while a certain thing is true, do these instructions. While the apple is red, keep picking out apples from the bag. As soon as you draw a green apple, the loop stops.
If-statements are conditional statements that you may have run across in math or any sort of work that involves logic. If-statements are TRUE/FALSE (called Boolean) conditions. If the apple is red, then eat the apple. It can also have an additional at the end called an Else-statement. Continuing on with apples. Otherwise (ELSE), say “I hate green apples!”
if(apple == "red"){
print("I am going to eat this apple")
} else {
print("I hate green apples!")
}
Back to our url links! We want to go through all the urls in
coll_pages and extract the links. This is suited for a
For-loop. For each page in the list of pages, get me the data. The lines
with # are comments that are ignored by R when you run
# A new list for storing our links
all_pots = list()
for (webpage in coll_pages) {
# Fetch the webpage
collect_webpage = read_html(webpage)
# Collect the elements storing the links
elements = html_elements(collect_webpage, ".mb-3 h3.lead a")
# Get the attributes with the link
pages = html_attr(elements, "href")
# Add the links to the list
all_pots = append(all_pots,pages)
# Pause for 1 second before continuing
Sys.sleep(1)
}
Most of this should be clear and once you run it, try
length(all_pots)which should return 72 meaning there are 72
urls in the list. Let’s cover the additional lines here. As the For-loop
does the same steps every time, you need to make an object to store your
results outside of the loop. We have made an empty list called
all_pots to do that. Append() is a function that adds a new
entry to your list. You give it the list and the thing you are adding to
the list. Finally, Sys.sleep() tells the system to wait a moment before
going again. As discussed in the workshop, this is to ensure you are not
overloading the server.
Now to the objects!
Similar to retrieving the webpages, it is best to start a webscrape by using a known webpage, making sure everything works along the way then automating. I have chosen a page that contains all the information we are hoping to scrape; a wonderful pipe in the form of a fish: https://data.fitzmuseum.cam.ac.uk/id/object/76905
Make sure you have selected what information you want prior to starting, as retrieving different types of data can be like miny puzzles in a bigger one and it can feel very frustrating to put most of the puzzle together and then be told you need to more the puzzle to a bigger table. Data plans are your friend. We are going to retrieve:
As well, we will add two final categories we want in our dataset: ID and website. ID is helpful so you have a unique number for each object and the website is for ease of use when checking data later.
Before we continue we do need to think about the end product. In this case, we are preparing data to be used in Kumu. Kumu accepts data as a CSV (comma-separated values), which is the simplest form of spreadsheet. It allows for no formatting, just data. Kumu also says you can have multiple values in a cell but they have to be separated by “|” (vertical bar symbol).This in fact makes it easier for us as we know that the number of values varies for categories. Some objects have multiple dating periods or materials used in production. So we can just put them all in one cell rather than thinking about separating unknown numbers of values.
More questions and problems will arise as we go but that is enough to get us going.
We begin the same as before by taking our url and fetching the HTML. This time we will just directly add the url to the function as an argument.
obj_page = read_html("https://data.fitzmuseum.cam.ac.uk/id/object/76905")
We now want to find the elements that store all of the items we need.
However, it appears to not be as simple as the last one. Elements are
stored with very little information connected to them so many different
elements appear under the same CSS selector. Play around with selector
gadget to see what you get.
To solve this, we select one element that stores all the object information first then we can work on our data.
obj_info = html_elements(obj_page, ".object-info")
Now here comes the difficult bit, do not be afraid as we will go over what it does, but we need to select for each element now containing the information and it is more complex than the first one. But remember it uses the same function we used before and same input. The function html_text2() takes the text for the elements given and formats it in a readable way.
title = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Title')
]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Title')]]")
html_text2(title)
## [1] "Pipe in the form of a fish"
The argument xpath= is similar to CSS selector but for
more complicated problems. In this case, when you look at the HTML for
the title and most of the elements you want. the elements are not nested
within each other but listed one after another. Visualise it as:
Versus:
This means we need to use placement of the elements next to each
other to find the ones we want. CSS selectors can do this in a way. For
Description, the CSS selector is .collection:nth-child(10).
nth-child(10) refers to the child
element (HTML uses child and parents as terms to indicate relationships
like a family tree) that is the tenth one down. But what if there is no
title?
This [lovely sauce boat][https://data.fitzmuseum.cam.ac.uk/id/object/72197] does
not have a title. The CSS selector for Description is
.collection:nth-child(8) It is now the eighth child element
down. So clearly, we cannot use the position number that CSS selectors
give. There are many elements that may or may not be included in object
data and could be multiple or only one, like Dating or School/Style. We
need to use relative position based on the text in each element to parse
out our data.
xpath = "//h3[preceding-sibling::h3[contains(text(), 'Title')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Title')]]")
Xpaths can give us relative position and this code translates: look
for the element <h3> that contains the text “Title”
and give us all the elements that follow it, but stop when you reach the
next element <h3> that is after the element
<h3> which is called “Title. A long complicated way
to say”Give me what is between these two points”.
With one done, we do much the similar steps for the rest of the categories.
# Maker
maker = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Maker(s)')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Maker(s)')]]")
html_text2(maker)
## [1] "Pottery: Unidentified Staffordshire Pottery"
# Id Numbers
id_numbers = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Identification numbers')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Identification numbers')]]")
html_text2(id_numbers)
## [1] "Accession number: EC.2-1941\nPrimary reference Number: 76905\nStable URI"
# Category
category = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Categories')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Categories')]]")
html_text2(category)
## [1] "Lead-glazed earthenware"
# Entities
entities = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Entities')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Entities')]]")
html_text2(entities)
## [1] "Pipe"
# Acquisition and important dates
acqu_dates = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(),'Acquisition and important dates')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Acquisition and important dates')]]")
html_text2(acqu_dates)
## [1] "Method of acquisition: Given (1941-03-22) by Partridge, Frank"
# Description
description = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Description')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Description')]]")
html_text2(description)
## [1] "Cream-coloured earthenware, moulded, and decorated with green, yellow, and yellowish-brown lead-glazes.\nWhieldon ware."
# Dating
dating = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Dating')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Dating')]]")
html_text2(dating)
## [1] "Third quarter of 18th century\nGeorge II\nGeorge III\nCirca 1755 CE - 1770 CE"
# Entities
entities = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Entities')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Entities')]]")
html_text2(entities)
## [1] "Pipe"
# Materials used in production
materials = html_elements(obj_info,xpath = "//h3[preceding-sibling::h3[contains(text(), 'Materials used in production')]][1]/preceding-sibling::*[preceding-sibling::h3[contains(text(), 'Materials used in production')]]")
html_text2(materials)
## [1] "Lead-glaze\nEarthenware"
Next, we need to deal with the last missing element: School or Style.
If you run the above code with it changed to ‘School or Style’, it
returns an empty container. Going back to the code, it appears there is
an additional <a> nested deeper. We will use
a different function html_element() (with no ‘s’) that selects a single
element and an xpath that finds us the elements descended from (nested
within), the
<p> element that follows the
<h3> with “School or Style” text.
school = html_element(obj_info,xpath = "//*/descendant::p[preceding-sibling::h3[contains(text(), 'School or Style')]]")
html_text2(school)
## [1] "Rococo"
We have retrieved all the elements we want from this one page.
Clean up our text with gsub
Automate using a for-loop again and a list outside to store our entries.
We aren’t going run our for-loop yet. We need to talk about “What ifs”. When you are running a webscraper, it is highly indiviudalised to the website you are dealing. The problems you come across will have to be handled one at a time and ass you do more webscraping you can anticipate some of them, but largely they are simply smalls puzzles to solve on you way to getting your code running. Thankfully I am not going to make you do the process of running your code until an error occurs. Instead, I am going to present you with some “what if this went wrong?” questions and how we solve them.
What if you aren’t retrieving the element you need? While the xpath for dating worked on the page we used, it returned empty on too many others and checking the code resulted in showing that they are similar to the School or Style xpath, so we switch to that Dating What if a page doesn’t exist? While you collected a url for all the objects, that doesn’t mean all of them are actually working. In this case, we are going to a trycatch to run the read_html() and if it fails and results in an error. It will store it for us to look at and move on. What if there are more than one elements returned? Often the description section is stored as separate elements. We can use a ifelse statment to check and fix. If the object length is greater than 1 then it will combine the text together. The else statment continues in the next question. What if there is no elements? Once we have checked for more than one element, we then can do another ifelse statement within the else statement for the previous. The if statement will cover, if there is just one normal element, and the else will cover if there are none returned, which will replace with NA values.
And that is it! These are repeated for all elements, which could be written more elegantly (aka shorter), but it does not matter and we understand it.
We finish by creating an id number by increasing the count for every time we go through the loop. Then we take all our results and put it in a list then add it to our existing empty list. A list within a list. We finally remember to use sys.sleep() to add a second delay, before repeating.
Once the for-loop is complete and we have our list of lists. We are going to turn it into a data frame which is a type of object that is made up of rows and columns. We then give names to our data frame columns.And finally save it!
WARNING: If you increase the number of pages and objects you wish to scrape, please know that increases the length of the runtime. All 255 pages will take approximately 5.5 hours to run.
The full script today without the initial steps is available on moodle.
If you have any questions, contact Leah at lmb211@cam.ac.uk.
All code and outputs are available on Leah’s Github: https://github.com/lmbrainerd/CDH_culturaldata
Thanks to Simon Carrignon for the assistance and answering of silly questions.